Expanded examinations of a low frequency modulation feature for speech/music discrimination
نویسنده
چکیده
A low frequency modulation feature, LFMAD, was examined under several conditions with regard to its robustness on speech/music discrimination. The feature was tested on LF components from 2 Hz to 27 Hz and with different analysis window sizes. This feature performs best when using an analysis window size containing only one period of the LF component to be used. When the music contained much vocals, the error rate increased compared with only instrumental music in the speech/music discrimination task. This effect was found in LFMAD as well as in the MFCC feature, which was used for comparison. Tests were also carried out with signals in additive noise from 30 dB to 0 dB SNR. LFMAD performed better than MFCC in these tests. The error rate was higher for speech signals. There was a bias towards classifying data as music when the test conditions diverged from those of the training condition. This effect is less obvious for LFMAD than for MFCC. The best results in this study were obtained when combining the two features LFMAD and MFCC into a mixed feature. This seems to be a more robust feature regarding the speech/music discrimination ability and could be recommended when scanning data bases of unknown quality for speech events. 1. Background and report layout The speech/music discrimination task has been examined by several authors [1, 2, 3, 4] with different approaches. The LF modulation of speech and music respectively show different behaviour. An LF modulation feature can be used to discriminate between speech and music [1]. The used feature, 4 Hz ASD, is now being further examined on several aspects. The following investigations are presented in this paper: • Comparison between the different LFMAD-n features (Low Frequency Modulation Amplitude and Deviation, n Hz ) and some combinations of them. This includes a study of different window sizes. • The effect of vocals in the music • The effect of additive noise in the test data 2. Data base, features and model The data base used in these experiments was the same as in the previous paper [1] extended with vocal music and choir music. Table 1. Sound database overview. Nr means number of speakers or number of music pieces. Training EER Test Sound Class Minutes Nr Minutes Nr Minute s Nr Speech 17 49 8 19 15 48 Instrumental 16 53 7 23 15 48 Vocal 19 41 10 21 14 31 Choir 14 9 --------15 10 Mixed 16 51 (8) (25) 11 38 Speech training and EER data are samples from the Waxholm data base [5]. EER data is used to calculate a threshold for the score, giving Equal Error Rate on EER data when using the models from training data. This threshold is also used on test data whenever a decision on only two classes is to be made. The music database consists of the four parts instrumental, vocal, choir and mixed music. All music training and EER data are samples from CD recordings. Mixed music contains both instrumental and vocal music pieces. The choir music is a cappella music. The mix of male/female speakers was approximately 75%/25% for training and EER data and 50% each for test data. The test data (except for the choir data) were collected from the Swedish broadcast during January and February 2001 using a standard FM receiver. The speech contained a variety of speaking styles and the music represented different styles such as pop, rock, country classical music etc. The choir music data were all collected from CD recordings. All data were sampled at 16 kHz with 16 bits in mono. As a standard feature for comparison was used the 39MFCC features, i. e. 13 Mel frequency cepstrum coefficients using a Hamming window of 32 ms with their delta and deltadelta coefficients calculated over 100 ms with linear regression. This gives a 39-dimensional vector. A 78-MFCC feature uses 26 MFCC features, giving a 78-dimensional vector. The LFMAD feature is described in section 3. Also the earlier reported [1] mixed feature, combined of 39-MFCC and LFMAD, were used in the tests. Some of the best of the LFMAD features, namely the LFMAD-4 (the same as 4 HZ ASD) and LFMAD-5 were used. All tests were performed using a GMM-32 (Gaussian Mixture Model with 32 mixture components). 3. Choice of LF component 3.1 Feature extraction The features were extracted from the LF modulation amplitude and its standard deviation for 20 critical bands, giving a 40-dimensional vector, referred to as LFMAD, in the same way as the 4 Hz ASD [1]. This means using a criticalband filter bank (20 bands), rectifying, low pass filtering at 28 Hz, normalising by long-term average and finally extracting the log power of the desired low frequency component, calculated by FFT. The size of the analysis window was varied from 37.5 ms for the 27 Hz feature up to 1 second. The standard deviation of each amplitude was calculated using 20 overlapping windows with an increment of 12.5 ms, giving a decision window size that varies from approximately 290 to 1250 ms. The influence of the analysis window size was examined by computing LFMAD-n, with n=2 to 27 and testing the discrimination ability. Some of them were also combined to see if this could improve the results.
منابع مشابه
Discrimination between speech and music based on a low frequency modulation feature
The possibility to discriminate between speech and music signals by using a feature based on low frequency modulation has been investigated. Three different low frequency modulation parameters have been extracted and tested concerning the ability of discrimination. The low frequency modulation amplitudes calculated over 20 critical bands and their standard deviations were found to be good featu...
متن کاملNew warped LPC-Based Feature for Fast and robust speech/Music Discrimination
Automatic discrimination of speech and music is an important tool in many multimedia applications. The paper presents a low complexity but effective approach for speech/music discrimination, which exploits only one simple feature, called Warped LPC-based Spectral Centroid (WLPC-SC). A three-component Gaussian Mixture Model (GMM) classifier is used because it showed a slightly better performance...
متن کاملSpeech/Music Discrimination Using a Single Warped LPC-Based Feature
Automatic discrimination of speech and music is an important tool in many multimedia applications. The paper presents a low complexity but effective approach for speech/music discrimination, which exploits only one simple feature, called Warped LPC-based Spectral Centroid (WLPC-SC). A three-component Gaussian Mixture Model (GMM) classifier is used because it showed a slightly better performance...
متن کاملSpeech-Music Discrimination from MPEG-1 Bitstream
This paper describes a proposed algorithm for speech/music discrimination, which works on data directly taken from MPEG encoded bitstream thus avoiding the computationally difficult decoding-encoding process. The method is based on thresholding of features derived from the modulation envelope of the frequency-limited audio signal. The discriminator is tested on more than 2 hours of audio data, ...
متن کاملشناسایی خودکار سبک موسیقی
Nowadays, automatic analysis of music signals has gained a considerable importance due to the growing amount of music data found on the Web. Music genre classification is one of the interesting research areas in music information retrieval systems. In this paper several techniques were implemented and evaluated for music genre classification including feature extraction, feature selection and m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002